Last updated: 2023-08-01

Checks: 5 1

Knit directory: WenjunLiu_Thesis_Chapter4/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200930) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Tracking code development and connecting the code version to the results is critical for reproducibility. To start using Git, open the Terminal and type git init in your project directory.


This project is not being versioned with Git. To obtain the full reproducibility benefits of using workflowr, please see ?wflow_start.


library(tidyverse)
library(yaml)
library(scales)
library(pander)
library(glue)
library(edgeR)
library(AnnotationHub)
library(ensembldb)
library(cowplot)
library(ggfortify)
library(magrittr)
library(cqn)
library(ggrepel)
library(DT)
# library(variancePartition)
library(BiocParallel)
library(UpSetR)
library(corrplot)
library(singscore)
library(msigdbr)
library(broom)
library(goseq)
library(WGCNA)
library(reactable)
library(ggforce)
panderOptions("table.split.table", Inf)
panderOptions("big.mark", ",")
theme_set(theme_bw())
config <- here::here("config/config.yml") %>%
  read_yaml()
suffix <- paste0(config$tag)
sp <- config$ref$species %>%
  str_replace("(^[a-z])[a-z]*_([a-z]+)", "\\1\\2") %>%
  str_to_title()

Setup

Gene Annotations

ah <- AnnotationHub() %>%
  subset(rdataclass == "EnsDb") %>%
  subset(str_detect(description, as.character(config$ref$release))) %>%
  subset(genome == config$ref$build)
stopifnot(length(ah) == 1)
ensDb <- ah[[1]]
genesGR <- read_rds(here::here("output/genesGR.rds"))

Gene annotations were again loaded from Ensembl Release 101. The previously defined GenomicRanges object containing GC content and Gene Length was also loaded, containing information for 67,990 genes.

Sample Merging

Dominant cell types (stroma, epithelial, ducts or fat) at the time of tissue collection were recorded and this was read in.

init_cellType <- read_delim(here::here("config/sample_meta.txt"), delim = "\t")
init_cellType <- init_cellType %>%
    mutate(Stroma = ifelse(str_detect(Dominant_cell_type, regex("Stroma", ignore_case = T)), TRUE, FALSE), 
           Epithelial = ifelse(str_detect(Dominant_cell_type, regex("Epithelial", ignore_case = T)), TRUE, FALSE), 
           Ducts = ifelse(str_detect(Dominant_cell_type, regex("ducts", ignore_case = T)), TRUE, FALSE), 
           Fat = ifelse(str_detect(Dominant_cell_type, regex("fat", ignore_case = T)), TRUE, FALSE), 
           patient = str_replace(patient, "TH", "TH-")) %>%
    pivot_longer(c("Stroma", "Epithelial", "Ducts", "Fat"),
                 names_to = "cell_type", 
                 values_to = "TF") %>%
    mutate(cell_type = ifelse(TF, cell_type, NA)) %>%
    dplyr::select(-c("TF", "Dominant_cell_type")) %>%
    .[!is.na(.$cell_type),] %>%
    chop("cell_type") %>%
    mutate(cell_type = vapply(.$cell_type, function(x){
        paste(x,collapse = ";")
    }, character(1)))
samples <- config$samples %>%
  here::here() %>%
  read_tsv() %>%
    left_join(init_cellType) %>%
  mutate(
    Filename = paste0(sample, ".r_1"), 
    condition = ifelse(Tumor, 
                       paste("Tumor", treat, sep = "_"), 
                       paste("Normal", treat, sep = "_")), 
    patient = vapply(.$patient, function(x){str_split(x, "-")[[1]][2]}, character(1)), 
    patient = ifelse(Tumor, 
                       paste("Tumor", patient, sep = "-"), 
                       paste("Normal",patient, sep = "-")), 
    desc = paste(patient, treat, sep = " ")
  ) %>%
  dplyr::select(-c("name", "sample")) %>%
  dplyr::rename(name = desc) %>%
  mutate_if(
    function(x){length(unique(x)) < length(x)},
    as.factor
  ) %>%
  mutate(
    treat = relevel(treat, ref = "Veh")
  )

Sample metadata was split to a normal and a primary malignant tumour group.

mergedSamples <- samples %>%
  group_by(name, patient, treat, Tumor, cell_type, Tissue_type, Age, Diagnosis) %>%
  tally()
normal_sample <- mergedSamples %>%
  dplyr::filter(Tumor == FALSE) %>%
    droplevels()
tumor_sample <- mergedSamples %>%
  dplyr::filter(Tumor == TRUE) %>%
    droplevels()

While we have tumours collected from 8 patients and 4 treatment arms, one DHT-treated sample was lost (i.e Tumor -8). For all downstream analyses, we will focus on the 31 ER-positive primary malignant tumours.

pander(
  mergedSamples %>%
      dplyr::filter(Tumor == TRUE),
  caption = "*Summary of sequencing total runs for all tumour samples*"
)
Summary of sequencing total runs for all tumour samples
name patient treat Tumor cell_type Tissue_type Age Diagnosis n
Tumor-1 DHT Tumor-1 DHT TRUE Epithelial Malignant 41 invasive carcinoma of no special type 8
Tumor-1 E2 Tumor-1 E2 TRUE Epithelial Malignant 41 invasive carcinoma of no special type 8
Tumor-1 E2+DHT Tumor-1 E2+DHT TRUE Epithelial Malignant 41 invasive carcinoma of no special type 8
Tumor-1 Veh Tumor-1 Veh TRUE Epithelial Malignant 41 invasive carcinoma of no special type 8
Tumor-2 DHT Tumor-2 DHT TRUE Stroma;Epithelial Malignant 78 infiltrating carcinoma of no special type 8
Tumor-2 E2 Tumor-2 E2 TRUE Stroma;Epithelial Malignant 78 infiltrating carcinoma of no special type 8
Tumor-2 E2+DHT Tumor-2 E2+DHT TRUE Stroma;Epithelial Malignant 78 infiltrating carcinoma of no special type 8
Tumor-2 Veh Tumor-2 Veh TRUE Stroma;Epithelial Malignant 78 infiltrating carcinoma of no special type 8
Tumor-3 DHT Tumor-3 DHT TRUE Epithelial Malignant 65 infiltrating carcinoma of no special type 8
Tumor-3 E2 Tumor-3 E2 TRUE Epithelial Malignant 65 infiltrating carcinoma of no special type 8
Tumor-3 E2+DHT Tumor-3 E2+DHT TRUE Epithelial Malignant 65 infiltrating carcinoma of no special type 8
Tumor-3 Veh Tumor-3 Veh TRUE Epithelial Malignant 65 infiltrating carcinoma of no special type 8
Tumor-4 DHT Tumor-4 DHT TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-4 E2 Tumor-4 E2 TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-4 E2+DHT Tumor-4 E2+DHT TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-4 Veh Tumor-4 Veh TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-5 DHT Tumor-5 DHT TRUE Stroma Malignant 48 infiltrating carcinoma of no special type 8
Tumor-5 E2 Tumor-5 E2 TRUE Stroma Malignant 48 infiltrating carcinoma of no special type 8
Tumor-5 E2+DHT Tumor-5 E2+DHT TRUE Stroma Malignant 48 infiltrating carcinoma of no special type 8
Tumor-5 Veh Tumor-5 Veh TRUE Stroma Malignant 48 infiltrating carcinoma of no special type 8
Tumor-6 DHT Tumor-6 DHT TRUE Epithelial Malignant 43 invasive carcinoma of no special type 8
Tumor-6 E2 Tumor-6 E2 TRUE Epithelial Malignant 43 invasive carcinoma of no special type 8
Tumor-6 E2+DHT Tumor-6 E2+DHT TRUE Epithelial Malignant 43 invasive carcinoma of no special type 8
Tumor-6 Veh Tumor-6 Veh TRUE Epithelial Malignant 43 invasive carcinoma of no special type 8
Tumor-7 DHT Tumor-7 DHT TRUE Stroma Malignant 53 invasive carcinoma of no special type 8
Tumor-7 E2 Tumor-7 E2 TRUE Stroma Malignant 53 invasive carcinoma of no special type 8
Tumor-7 E2+DHT Tumor-7 E2+DHT TRUE Stroma Malignant 53 invasive carcinoma of no special type 8
Tumor-7 Veh Tumor-7 Veh TRUE Stroma Malignant 53 invasive carcinoma of no special type 8
Tumor-8 E2 Tumor-8 E2 TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-8 E2+DHT Tumor-8 E2+DHT TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8
Tumor-8 Veh Tumor-8 Veh TRUE Epithelial Malignant 76 invasive carcinoma of no special type 8

This data set is slightly unique in that samples were sequenced across multiple lanes and flowcells. Metadata was re-organised to enable summarising of counts across the multiple sequencing runs obtained for every sample. The initial PCA showed that variability between replicates and flowcells was dwarfed by variability between samples and whether it was tumor. As such, all counts were able to be merged to give a single set of counts for each individual sample.

treat_cols <- c(
  Veh = rgb(0.7, 0.7, 0.7),
  DHT = rgb(0.8, 0.2, 0.2),
  E2 = rgb(0.2, 0.2, 0.8),
  `E2+DHT` = rgb(1, 0.4, 1)
)
patient_cols <- hcl.colors(
  n = length(levels(tumor_sample$patient)), 
  palette = "Spectral"
  ) %>%
  setNames(levels(tumor_sample$patient))
age_cols <- hcl.colors(
  n = 3, 
  palette = "Zissou 1"
  ) %>%
  setNames(c("<30", "<50", ">=50"))
treat_shapes <- c(
  Veh = 1,
  DHT = 19,
  E2 = 15,
  `E2+DHT` = 17
)

Count Data

# counts <- here::here("data/aligned/counts/counts.out.gz") %>%
#   gzfile() %>%
#   read_tsv(comment = "#") %>%
#   dplyr::select(Geneid, ends_with("bam")) %>%
#   rename_at(vars(ends_with("bam")), dirname) %>%
#   rename_all(basename) %>%
#   column_to_rownames("Geneid")
# mergedCounts <- counts %>%
#   rownames_to_column("gene_id") %>%
#   pivot_longer(
#     cols = -gene_id,
#     names_to = "Filename",
#     values_to = "counts"
#   ) %>%
#   left_join(samples, by = "Filename") %>%
#   group_by(
#     gene_id, name, patient, treat, Tumor) %>%
#   summarise(counts = sum(counts), .groups = "drop") %>%
#   pivot_wider(
#     id_cols = gene_id,
#     values_from = counts,
#     names_from = name
#   ) %>%
#   column_to_rownames("gene_id")
# saveRDS(mergedCounts, here::here("data/mergedCounts.rds"))
mergedCounts <- readRDS(here::here("data/mergedCounts.rds"))
rm(counts)
rm(samples)
gc()
           used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 10956201 585.2   16488054 880.6         NA 16488054 880.6
Vcells 26362090 201.2   38677362 295.1     204800 32164469 245.4
minCPM <- 1.5
minSamples_tumor <- 8
genes2Keep_tumor <- mergedCounts[,str_subset(colnames(mergedCounts), "Tumor")] %>%
  edgeR::cpm() %>%
  is_greater_than(minCPM) %>%
  rowSums() %>%
  is_weakly_greater_than(minSamples_tumor)

\(>\) 1.5 counts per million (CPM) were required to observed in \(\geq\) 8 samples among tumor samples for a gene to be considered as detected.

Of the 60,671 genes contained in the annotation for this release, 42,234 genes were removed as failing this criteria for detection among tumor tissues, leaving 18,437 genes for downstream analysis.

a2 <- mergedCounts[,str_subset(colnames(mergedCounts), "Tumor")] %>%
  edgeR::cpm(log = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  as_tibble() %>%
  pivot_longer(
    cols = contains("Tumor"),
    names_to = "name",
    values_to = "logCPM"
  ) %>%
  left_join(tumor_sample) %>%
  ggplot(aes(logCPM, stat(density), colour = patient, linetype = treat)) +
  geom_density() +
  scale_colour_manual(values= patient_cols[str_subset(names(patient_cols), "Tumor")]) +
  labs(
    y = "Density",
    colour = "Patient",
    linetype = "Treatment"
  )
b2 <- mergedCounts[genes2Keep_tumor,str_subset(colnames(mergedCounts), "Tumor")] %>%
  edgeR::cpm(log = TRUE) %>%
  as.data.frame() %>%
  rownames_to_column("gene_id") %>%
  as_tibble() %>%
  pivot_longer(
    cols = contains("Tumor"),
    names_to = "name",
    values_to = "logCPM"
  ) %>%
  left_join(tumor_sample) %>%
  ggplot(aes(logCPM, stat(density), colour = patient, linetype = treat)) +
  geom_density() +
  scale_colour_manual(values= patient_cols[str_subset(names(patient_cols), "Tumor")]) +
  labs(
    y = "Density",
    colour = "Patient",
    linetype = "Treatment"
  )
plot_grid(
    a2 + theme(legend.position = "none"), 
    b2, 
    labels = c("A", "B"),
    ncol = 2, 
    rel_widths = c(0.4, 0.6))
*Distributions of logCPM values on merged counts, A) before and B) after filtering of undetectable genes for the 31 tumor sampels. Some differences between patients were noted.*

Distributions of logCPM values on merged counts, A) before and B) after filtering of undetectable genes for the 31 tumor sampels. Some differences between patients were noted.

A dgeListwas formed for the tumour samples.

dge_tumor <- mergedCounts[genes2Keep_tumor,colnames(mergedCounts) %in% tumor_sample$name] %>%
  DGEList(
    samples = tumor_sample %>%
      as.data.frame() %>%
      set_rownames(.$name) %>%
      .[,colnames(.)],
    group = tumor_sample %>%
      as.data.frame() %>%
      set_rownames(.$name) %>%
      .[,colnames(.)] %>%
      pull(treat),
    genes = genesGR[rownames(.)]
  ) %>%
  calcNormFactors()

QC

Library Sizes

dge_tumor$samples %>%
    ggplot(aes(treat, lib.size, fill = treat)) +
    geom_col() +
    geom_hline(yintercept = 1e7, linetype = 2) +
    facet_wrap(~patient, ncol = 5) +
    scale_y_continuous(
        labels = comma, expand = expansion(c(0, 0.05))
    ) +
    scale_fill_manual(values= treat_cols) +
    labs(x = "Sample Name", y = "Library Size", fill = "Treatment")  +
    theme(legend.position = c(11/12, 1/6))
*Library sizes of all tumour samples after removal of undetectable genes. The common-use minimum library size of 10 million reads is shown as a dashed line.*

Library sizes of all tumour samples after removal of undetectable genes. The common-use minimum library size of 10 million reads is shown as a dashed line.

Given that previous work had only assessed the libraries prior to merging, the library sizes were checked after merging patient sequencing runs. The median library size was found to be 12.9 million reads for tumor samples which are just above the common-use minimum recommendation of 10 million reads/sample.

PCA: Pre Normalisation

A PCA was performed on all the tumour samples.

pca_tumor <- dge_tumor %>%
  edgeR::cpm(log = TRUE) %>%
  t() %>%
  prcomp() 

No common direction based on treatment appears evident with each patient, and great inter-patient heterogenity was observed even among the control samples.

g1 <- pca_tumor %>%
    autoplot(
        data = dge_tumor$samples, 
        x = 1, y = 2,
        colour = "treat", 
        shape = "treat",
        size = 3
    ) +
    facet_wrap(~patient, ncol = 4) +
    scale_colour_manual(values= treat_cols) +
    scale_shape_manual(values = treat_shapes) +
    labs(
        colour = "Treatment",
        shape = "Treatment"
    )
g2 <- pca_tumor %>%
    autoplot(
        data = dge_tumor$samples, 
        x = 1, y = 2,
        colour = "treat", 
        shape = "treat",
        size = 3
    ) +
    scale_colour_manual(values= treat_cols) +
    scale_shape_manual(values = treat_shapes) +
    labs(
        colour = "Treatment",
        shape = "Treatment"
    )
plot_grid(g1, g2, 
          ncol = 2, 
          rel_widths = c(1.8,1), 
          scale = c(1, 0.8), 
          labels = c("(a)", "(b)"))
*PCA on logCPM from merged counts for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient, and great inter-patient heterogenity was observed even among the control samples.*

PCA on logCPM from merged counts for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient, and great inter-patient heterogenity was observed even among the control samples.

Checks for GC and Length bias

The impact of GC content or length bias was assessed as a possible source.

Genes were divided in 10 approximately equal sized bins based on increasing length, and 10 approximately equal sized bins based on increasing GC content, with the final GC/Length bins being the combination 100 bins using both sets. The contribution of each gene to PC1 and PC2 was assessed and a t-test performed on each bin. This tests

\[ H_0: \mu = 0 \text{ against } H_A: \mu \neq 0 \]

where \(\mu\) represents the true contribution to PC1 of all genes in that bin.

If any bin makes a contribution to PC1 the mean will be clearly non-zero, whilst if there is no contribution the mean will be near zero. In this way, the impact of gene length and GC content on variance within the dataset can be assessed.

As seen below, the contribution of GC content and gene length to PC1 is very clear in tumor samples, with a smaller contribution being evident across PC1. As a result, Conditional Quantile Normalisation (CQN) is recommended in preference to the more common TMM normalisation.

dge_tumor$genes %>%
  dplyr::select(gene_id, ave_tx_len, gc_content) %>%
  mutate(
    GC = cut(
      x = gc_content,
      labels = seq_len(10),
      breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    Length = cut(
      x = ave_tx_len,
      labels = seq_len(10),
      breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    bin = paste(GC, Length, sep = "_"),
    PC1 = pca_tumor$rotation[gene_id, "PC1"],
    PC2 = pca_tumor$rotation[gene_id, "PC2"]
  ) %>%
  pivot_longer(
    cols = c("PC1", "PC2"),
    names_to = "PC",
    values_to = "value"
  ) %>%
  group_by(PC, GC, Length, bin) %>%
  summarise(
    Size = n(),
    mean = mean(value),
    sd = sd(value),
    t = t.test(value)$statistic,
    p = t.test(value)$p.value,
    adjP = p.adjust(p, method = "bonf")
  ) %>%
  ggplot(
    aes(Length, GC, colour = t, alpha = -log10(adjP), size = Size)
  ) +
  geom_point() +
  facet_wrap(~PC) +
  scale_colour_gradient2() +
  scale_size_continuous(range = c(1, 10)) +
  labs(alpha = expression(paste(-log[10], p))) +
  theme(
    panel.grid = element_blank(),
    legend.position = "bottom"
  ) 
*Contribution of each GC/Length Bin to PC1 and PC2 among tumor samples. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values. The number of genes in each bin is indicated by the circle size. The clear pattern across PC1 is unambiguous.*

Contribution of each GC/Length Bin to PC1 and PC2 among tumor samples. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values. The number of genes in each bin is indicated by the circle size. The clear pattern across PC1 is unambiguous.

Normalisation

cqNorma_tumor <- with(
  dge_tumor,
  cqn(
    counts= counts,
    x = genes$gc_content,
    lengths = genes$ave_tx_len
  )
)
dge_tumor$offset <- cqNorma_tumor$glm.offset
logCPM_tumor <- cqNorma_tumor$y + cqNorma_tumor$offset
# saveRDS(dge_tumor, here::here("output/dge_tumor.rds"))
# saveRDS(logCPM_tumor, here::here("output/logCPM_tumor.rds"))
a2 <- cqNorma_tumor$func1 %>%
    as.data.frame() %>%
    mutate(x = cqNorma_tumor$grid1) %>%
    pivot_longer(
        cols = any_of(colnames(dge_tumor)),
        names_to = "name",
        values_to = "QR fit"
    ) %>%
    left_join(dge_tumor$samples) %>%
    ggplot(
        aes(x, `QR fit`, colour = patient, group = name, linetype = treat)
    ) +
    geom_line() +
    labs(x = "GC content", colour = "Patient", linetype = "Treatment")
b2 <- cqNorma_tumor$func2 %>%
    as.data.frame() %>%
    mutate(x = cqNorma_tumor$grid2) %>%
    pivot_longer(
        cols = any_of(colnames(dge_tumor)),
        names_to = "name",
        values_to = "QR fit"
    ) %>%
    left_join(dge_tumor$samples) %>%
    ggplot(
        aes(x, `QR fit`, colour = patient, group = name, linetype = treat)
    ) +
    geom_line() +
    labs(
        x = expression(paste(log[10], " Gene Length (kb)")),
        colour = "Patient", linetype = "Treatment"
    )
plot_grid(
        a2 + theme(legend.position = "none"), 
        b2 + theme(legend.position = "none"),
        nrow = 1)
*Model fits used when applying CQN in all tumor samples. The divergent samples previously noted on the PCA are again quite divergent here. In particular, the long genes with high GC content appear to be where the primary differences are found, in keeping with the previous PCA analysis.*

Model fits used when applying CQN in all tumor samples. The divergent samples previously noted on the PCA are again quite divergent here. In particular, the long genes with high GC content appear to be where the primary differences are found, in keeping with the previous PCA analysis.

pcaPost_tumor <- logCPM_tumor %>%
  t() %>%
  prcomp() 

The strong inter-patient heterogeneity is sitll evident even post-normalization.

g1 <- pcaPost_tumor %>%
    autoplot(
        data = dge_tumor$samples, 
        x = 1, y = 2,
        colour = "treat", 
        shape = "treat",
        size = 3
    ) +
    facet_wrap(~patient, ncol = 4) +
    scale_colour_manual(values= treat_cols) +
    scale_shape_manual(values = treat_shapes) +
    labs(
        colour = "Treatment",
        shape = "Treatment"
    )
g2 <- pcaPost_tumor %>%
    autoplot(
        data = dge_tumor$samples, 
        x = 1, y = 2,
        colour = "treat", 
        shape = "treat",
        size = 3
    ) +
    scale_colour_manual(values= treat_cols) +
    scale_shape_manual(values = treat_shapes) +
    labs(
        colour = "Treatment",
        shape = "Treatment"
    )
# pdf(file = here::here("figure/pcaPostNor_tumor.pdf"),
#     width = 12,
#     height = 5)
plot_grid(g1, g2, 
          ncol = 2, 
          rel_widths = c(1.3,1), 
          scale = c(0.9, 0.8), 
          labels = c("(a)", "(b)"))
*PCA on logCPM after performing CQN for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient. Great inter-patient heterogenity still exists even post-normalisation.*

PCA on logCPM after performing CQN for (a) each of, (b) all of the 8 tumor patients. No common direction based on treatment appears evident with each patient. Great inter-patient heterogenity still exists even post-normalisation.

# dev.off()
dge_tumor$genes %>%
  dplyr::select(gene_id, ave_tx_len, gc_content) %>%
  mutate(
    GC = cut(
      x = gc_content,
      labels = seq_len(10),
      breaks = quantile(gc_content, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    Length = cut(
      x = ave_tx_len,
      labels = seq_len(10),
      breaks = quantile(ave_tx_len, probs = seq(0, 1, length.out = 11)),
      include.lowest = TRUE
    ),
    bin = paste(GC, Length, sep = "_"),
    `pre-CQN` = pca_tumor$rotation[gene_id, "PC1"],
    `post-CQN` = pcaPost_tumor$rotation[gene_id, "PC1"]
  ) %>%
  pivot_longer(
    cols = c("pre-CQN", "post-CQN"),
    names_to = "status",
    values_to = "value"
  ) %>%
  mutate(status = factor(status, levels = c("pre-CQN", "post-CQN"))) %>%
  group_by(status, GC, Length, bin) %>%
  summarise(
    Size = n(),
    mean = mean(value),
    sd = sd(value),
    t = t.test(value)$statistic,
    p = t.test(value)$p.value,
    FDR = p.adjust(p, method = "bonf")
  ) %>%
  ggplot(
    aes(Length, GC, colour = t, alpha = -log10(FDR), size = Size)
  ) +
  geom_point() +
  facet_wrap(~status) +
  scale_colour_gradient2() +
  scale_size_continuous(range = c(1, 10)) +
  theme(
    panel.grid = element_blank(),
    legend.position = "bottom"
  )
*Contributions of gene length and GC content to PC1 shown both before and after CQN among tumor samples.. The strong patterns seen before normalisation have clearly been reduced. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values.*

Contributions of gene length and GC content to PC1 shown both before and after CQN among tumor samples.. The strong patterns seen before normalisation have clearly been reduced. Fill colours indicate the t-statistic, with tranparency denoting significance as -log10(p), using Bonferroni-adjusted p-values.

Heterogeneity

Because of the high inter-patient heterogeneity observed, correlation between the first three PCA components and sample metadata was tested.

As shown in the figure below, pathologist’s labels on the tumours strongly correlated with the first three PCA components.

pcaPost_tumor$x %>%
    as.data.frame() %>%
    rownames_to_column("name") %>%
    left_join(tumor_sample) %>%
    dplyr::select(PC1, PC2, PC3, patient, treat, Diagnosis) %>%
    dplyr::rename(
                  Treatment = treat, 
                  Patient = patient
                  ) %>%
    mutate_all(as.numeric) %>%
    cor() %>% 
    corrplot(
        type = "lower", 
        diag = FALSE, 
        addCoef.col = 1, 
        col = rev(COL2('RdBu', 200)),
        addCoefasPercent = TRUE
    )
*Correlations between the first three principal components and sample metadata in tumor samples. Treatment, patient, and pathologist diagnosis were converted to an ordered categorical variable for the purposes of visualisation.*

Correlations between the first three principal components and sample metadata in tumor samples. Treatment, patient, and pathologist diagnosis were converted to an ordered categorical variable for the purposes of visualisation.

DE Analysis

formatP <- function(p, m = 0.0001){
out <- rep("", length(p))
out[p < m] <- sprintf("%.2e", p[p<m])
out[p >= m] <- sprintf("%.4f", p[p>=m])
out
}

Average treatment responses among tumor samples were tested through fitting quasi-likelihood negative binomial GLMs through edgeR.

X_tumor <- model.matrix(~ 0 + patient + treat, data = dge_tumor$samples %>%
                            droplevels()) %>%
  set_colnames(str_remove_all(colnames(.), "treat"))
dge_tumor <- estimateDisp(dge_tumor, design = X_tumor, robust = TRUE)
fit_tumor <- glmQLFit(dge_tumor)
colnames(X_tumor) <- make.names(colnames(X_tumor))
contrast_tumor <- makeContrasts(DHT = DHT,
                          E2 = E2,
                          `E2+DHT` = E2.DHT,
                          `E2+DHTvsDHT` = E2.DHT - DHT,
                          `E2+DHTvsE2` = E2.DHT - E2,
                          levels = X_tumor)
alpha <- 0.05
topTables_tumor <- colnames(contrast_tumor) %>%
    str_subset("patient", TRUE) %>%
    sapply(function(x){
        glmQLFTest(fit_tumor, contrast = contrast_tumor[,x]) %>%
            topTags(n = Inf) %>%
            .[["table"]] %>%
            as_tibble() %>%
            mutate(
                coef = x,
                location = paste0(seqnames, ":", start, "-", end, ":", strand),
                rankingStat = -sign(logFC)*log10(PValue),
                signedRank = rank(rankingStat),
                DE = FDR < alpha
            ) %>%
            dplyr::select(
                gene_id, gene_name, logCPM, logFC, PValue, FDR, coef,
                location, gene_biotype, entrezid, ave_tx_len, gc_content,
                rankingStat, signedRank, DE
            )
    },
    simplify = FALSE)
# saveRDS(topTables_tumor, here::here("output/topTables_tumor.rds"))

No DEG was detected in any of the treatment or contrast group:

topTables_tumor %>%
    lapply(dplyr::filter, FDR < alpha) %>%
    vapply(nrow, integer(1)) %>%
    set_names(c("DHT vs Veh", "E2 vs Veh","E2+DHT vs Veh", "E2+DHT vs DHT", "E2+DHT vs E2" )) %>%
    pander()
DHT vs Veh E2 vs Veh E2+DHT vs Veh E2+DHT vs DHT E2+DHT vs E2
0 0 0 0 0

The full DE analysis results are available in the table below:

topTables_tumor %>%
    # lapply(dplyr::filter, FDR < alpha) %>%
    lapply(dplyr::select, gene_name, logFC, FDR, coef, DE) %>%
    bind_rows() %>%
    mutate_if(is.numeric, signif, 2) %>%
    mutate_at(vars(contains(c("gene_name", "coef"))), as.factor) %>%
    mutate(Direction = ifelse(logFC < 0, "Down", "Up")) %>%
    dplyr::rename(`Gene name` = gene_name, 
                  `Comparision` = coef) %>%
    datatable(
        filter = "top",
        options = list(scrollY = '5500px')) %>% 
    formatStyle(
        c("logFC", "Direction"),
        color = styleEqual(c("Down", "Up"), c('blue', 'red'))
    )

Thesis figure

Code used to generate figures for the thesis.

tumor_sample <- tumor_sample %>%
    .[match(rownames(pcaPost_tumor$x), .$name),]
pcaPost_tumor_df <- pcaPost_tumor$x %>%
    as.data.frame() %>%
    rownames_to_column("name") %>%
    left_join(
        tumor_sample
    )
thesis_1b <- pcaPost_tumor_df %>%
    ggplot(
        aes(PC1, PC2)
    ) +
    geom_point(aes( color = patient, shape = treat), 
               size = 4) +
    scale_shape_manual(values = treat_shapes) +
    labs(
        colour = "Patient",
        fill = "Patient",
        shape = "Treatment"
    ) +
    geom_mark_ellipse(
        aes(
            label = patient,
            fill = patient
        ), 
        label.fontsize = 24
    ) +
    guides(color = FALSE, fill = FALSE) +
    theme(
        panel.grid = element_blank(), 
        text = element_text(size = 40)
    )
# png("~/PhD_thesis/draft_figure/chapter_04/Figure1B.png",
#     width = 1200,
#     height = 600)
# thesis_1b
# dev.off()
# png("~/PhD_thesis/draft_figure/chapter_04/Figure1C.png",
#     width = 1000,
#     height = 1000)
# pcaPost_tumor$x %>%
#    as.data.frame() %>%
#    rownames_to_column("name") %>%
#    left_join(tumor_sample) %>%
#    dplyr::select(PC1, PC2, PC3, patient, treat, Age, Diagnosis) %>%
#    dplyr::rename(
#                  Treatment = treat,
#                  Patient = patient,
#                  `Pathology Diagnosis` = Diagnosis
#                  ) %>%
#    mutate(Age = as.numeric(as.character(Age)),
#           across(-"Age", as.numeric)) %>%
#    cor() %>%
#    corrplot(
#        type = "lower",
#        diag = FALSE, addCoef.col = "black",
#        col = rev(COL2('RdBu', 200)),
#        addCoefasPercent = TRUE, 
#        number.cex = 2, 
#        tl.cex = 2, 
#        tl.col = "black", 
#        cl.cex = 2
#    )
# dev.off()

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Australia/Adelaide
tzcode source: internal

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] ggforce_0.4.1           reactable_0.4.4         WGCNA_1.72-1           
 [4] fastcluster_1.2.3       dynamicTreeCut_1.63-1   goseq_1.52.0           
 [7] geneLenDataBase_1.36.0  BiasedUrn_2.0.10        broom_1.0.5            
[10] msigdbr_7.5.1           singscore_1.20.0        corrplot_0.92          
[13] UpSetR_1.4.0            BiocParallel_1.34.2     DT_0.28                
[16] ggrepel_0.9.3           cqn_1.46.0              quantreg_5.95          
[19] SparseM_1.81            preprocessCore_1.62.1   nor1mix_1.3-0          
[22] mclust_6.0.0            magrittr_2.0.3          ggfortify_0.4.16       
[25] cowplot_1.1.1           ensembldb_2.24.0        AnnotationFilter_1.24.0
[28] GenomicFeatures_1.52.1  AnnotationDbi_1.62.2    Biobase_2.60.0         
[31] GenomicRanges_1.52.0    GenomeInfoDb_1.36.1     IRanges_2.34.1         
[34] S4Vectors_0.38.1        AnnotationHub_3.8.0     BiocFileCache_2.8.0    
[37] dbplyr_2.3.3            BiocGenerics_0.46.0     edgeR_3.42.4           
[40] limma_3.56.2            glue_1.6.2              pander_0.6.5           
[43] scales_1.2.1            yaml_2.3.7              lubridate_1.9.2        
[46] forcats_1.0.0           stringr_1.5.0           dplyr_1.1.2            
[49] purrr_1.0.1             readr_2.1.4             tidyr_1.3.0            
[52] tibble_3.2.1            ggplot2_3.4.2           tidyverse_2.0.0        

loaded via a namespace (and not attached):
  [1] later_1.3.1                   BiocIO_1.10.0                
  [3] bitops_1.0-7                  filelock_1.0.2               
  [5] polyclip_1.10-4               graph_1.78.0                 
  [7] rpart_4.1.19                  XML_3.99-0.14                
  [9] lifecycle_1.0.3               doParallel_1.0.17            
 [11] rprojroot_2.0.3               vroom_1.6.3                  
 [13] lattice_0.21-8                MASS_7.3-60                  
 [15] crosstalk_1.2.0               backports_1.4.1              
 [17] Hmisc_5.1-0                   sass_0.4.6                   
 [19] rmarkdown_2.23                jquerylib_0.1.4              
 [21] httpuv_1.6.11                 DBI_1.1.3                    
 [23] zlibbioc_1.46.0               RCurl_1.98-1.12              
 [25] nnet_7.3-19                   tweenr_2.0.2                 
 [27] rappdirs_0.3.3                git2r_0.32.0                 
 [29] GenomeInfoDbData_1.2.10       MatrixModels_0.5-2           
 [31] annotate_1.78.0               codetools_0.2-19             
 [33] DelayedArray_0.26.6           xml2_1.3.5                   
 [35] tidyselect_1.2.0              farver_2.1.1                 
 [37] base64enc_0.1-3               matrixStats_1.0.0            
 [39] GenomicAlignments_1.36.0      jsonlite_1.8.7               
 [41] Formula_1.2-5                 ellipsis_0.3.2               
 [43] iterators_1.0.14              survival_3.5-5               
 [45] foreach_1.5.2                 tools_4.3.0                  
 [47] progress_1.2.2                Rcpp_1.0.11                  
 [49] gridExtra_2.3                 here_1.0.1                   
 [51] xfun_0.39                     mgcv_1.9-0                   
 [53] MatrixGenerics_1.12.2         withr_2.5.0                  
 [55] BiocManager_1.30.21           fastmap_1.1.1                
 [57] fansi_1.0.4                   digest_0.6.33                
 [59] timechange_0.2.0              R6_2.5.1                     
 [61] mime_0.12                     colorspace_2.1-0             
 [63] GO.db_3.17.0                  biomaRt_2.56.1               
 [65] RSQLite_2.3.1                 utf8_1.2.3                   
 [67] generics_0.1.3                data.table_1.14.8            
 [69] rtracklayer_1.60.0            prettyunits_1.1.1            
 [71] httr_1.4.6                    htmlwidgets_1.6.2            
 [73] S4Arrays_1.0.4                pkgconfig_2.0.3              
 [75] gtable_0.3.3                  blob_1.2.4                   
 [77] impute_1.74.1                 workflowr_1.7.0              
 [79] XVector_0.40.0                htmltools_0.5.5              
 [81] GSEABase_1.62.0               ProtGenerics_1.32.0          
 [83] png_0.1-8                     knitr_1.43                   
 [85] rstudioapi_0.15.0             tzdb_0.4.0                   
 [87] reshape2_1.4.4                rjson_0.2.21                 
 [89] checkmate_2.2.0               nlme_3.1-162                 
 [91] curl_5.0.1                    cachem_1.0.8                 
 [93] BiocVersion_3.17.1            parallel_4.3.0               
 [95] foreign_0.8-84                restfulr_0.0.15              
 [97] pillar_1.9.0                  grid_4.3.0                   
 [99] vctrs_0.6.3                   promises_1.2.0.1             
[101] cluster_2.1.4                 xtable_1.8-4                 
[103] htmlTable_2.4.1               evaluate_0.21                
[105] cli_3.6.1                     locfit_1.5-9.8               
[107] compiler_4.3.0                Rsamtools_2.16.0             
[109] rlang_1.1.1                   crayon_1.5.2                 
[111] labeling_0.4.2                plyr_1.8.8                   
[113] fs_1.6.2                      stringi_1.7.12               
[115] babelgene_22.9                munsell_0.5.0                
[117] Biostrings_2.68.1             lazyeval_0.2.2               
[119] Matrix_1.6-0                  hms_1.1.3                    
[121] bit64_4.0.5                   statmod_1.5.0                
[123] KEGGREST_1.40.0               shiny_1.7.4.1                
[125] highr_0.10                    SummarizedExperiment_1.30.2  
[127] interactiveDisplayBase_1.38.0 memoise_2.0.1                
[129] bslib_0.5.0                   bit_4.0.5